knitr::opts_chunk$set(echo = TRUE, include = TRUE,fig.align = "center")
options(scipen = 999, knitr.kable.NA = "")
# Installing libraries (if do not have)
##install.packages("tidyverse")
##install.packages("lubridate")
##install.packages("readxl")
##install.packages("skimr")
##install.packages("magrittr")
##install.packages("tidyquant")
##install.packages("tsibble")
##install.packages("feasts")
##install.packages("ggcorrplot")
##install.packages("glmnet")
##install.packages("caret")
##install.packages("rattle")
# Importing libraries
library(tidyverse)
library(lubridate)
library(readxl)
library(skimr)
library(magrittr)
library(tidyquant)
library(tsibble)
library(feasts)
library(ggcorrplot)
library(glmnet)
library(caret)
library(rattle)
set.seed(1234)
Electricity price, consumption and production data is imported from csv file.
I have chosen the dollars for the prediction currency due to the fact that the US dollar is one of the most widely used, generally accepted and stable currencies in the world. Other currencies especially TRY may fluctuate greatly over the years due to irrelevant factors of this analysis topic such as inflation, interest, etc.
df <- read_csv("mcp_with_variables.csv") %>% select(-mcp_try, -mcp_euro)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## date = col_date(format = ""),
## hour = col_double(),
## mcp_try = col_double(),
## mcp_dollars = col_double(),
## mcp_euro = col_double(),
## load_plan = col_double(),
## total_prod = col_double(),
## natural_gas = col_double(),
## wind = col_double(),
## lignite = col_double(),
## black_coal = col_double(),
## import_coal = col_double(),
## fuel_oil = col_double(),
## geothermal = col_double(),
## dam = col_double(),
## naphta = col_double(),
## biomass = col_double(),
## river = col_double(),
## other = col_double()
## )
head(df)
## # A tibble: 6 x 17
## date hour mcp_dollars load_plan total_prod natural_gas wind lignite
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-06-01 0 39.0 25600 20657. 6198. 169. 2799
## 2 2015-06-01 1 33.1 23900 20292. 6391. 166. 2799
## 3 2015-06-01 2 18.4 23000 19666. 6403. 172. 2799
## 4 2015-06-01 3 12.7 22600 19650. 6388. 182. 2799
## 5 2015-06-01 4 10.9 22500 19664. 6300. 189. 2799
## 6 2015-06-01 5 11.3 21900 20028. 6591. 196. 2799
## # … with 9 more variables: black_coal <dbl>, import_coal <dbl>, fuel_oil <dbl>,
## # geothermal <dbl>, dam <dbl>, naphta <dbl>, biomass <dbl>, river <dbl>,
## # other <dbl>
Summary statistics of variables are shown.
We can say that mcp_dollars have some extreme values (p100 = 542). Also, it can be stated that load_plan and total_prod have similar distribution from their histogram.
skim(df)
| Name | df |
| Number of rows | 53112 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2015-06-01 | 2021-06-21 | 2018-06-11 | 2213 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hour | 0 | 1 | 11.50 | 6.92 | 0.00 | 5.75 | 11.50 | 17.25 | 23.00 | ▇▇▆▇▇ |
| mcp_dollars | 0 | 1 | 45.20 | 14.36 | 0.00 | 38.54 | 45.70 | 54.66 | 542.00 | ▇▁▁▁▁ |
| load_plan | 0 | 1 | 32657.61 | 5290.13 | 0.00 | 28723.75 | 32800.00 | 36500.00 | 46500.00 | ▁▁▃▇▃ |
| total_prod | 0 | 1 | 29530.99 | 4863.54 | 934.26 | 25849.19 | 29505.54 | 33134.57 | 42814.23 | ▁▁▃▇▂ |
| natural_gas | 0 | 1 | 8817.98 | 3684.93 | 13.50 | 6434.35 | 9044.28 | 11381.06 | 17928.17 | ▂▅▇▅▁ |
| wind | 0 | 1 | 1964.53 | 1453.06 | 14.99 | 769.02 | 1640.57 | 2887.93 | 7486.13 | ▇▅▃▁▁ |
| lignite | 0 | 1 | 3968.60 | 705.54 | 0.00 | 3514.00 | 4008.00 | 4464.15 | 6806.10 | ▁▁▇▇▁ |
| black_coal | 0 | 1 | 393.35 | 310.50 | 0.00 | 139.00 | 286.00 | 662.00 | 1179.00 | ▇▃▅▃▁ |
| import_coal | 0 | 1 | 5607.55 | 1552.21 | 330.00 | 4520.00 | 5590.00 | 7006.20 | 8112.00 | ▁▂▇▇▇ |
| fuel_oil | 0 | 1 | 109.17 | 182.38 | 0.00 | 1.10 | 47.60 | 115.00 | 1008.00 | ▇▁▁▁▁ |
| geothermal | 0 | 1 | 641.32 | 302.12 | 7.80 | 450.25 | 679.80 | 876.96 | 1173.94 | ▃▅▇▇▆ |
| dam | 0 | 1 | 5687.11 | 2815.37 | 23.00 | 3490.68 | 5552.30 | 7560.60 | 15942.88 | ▅▇▆▂▁ |
| naphta | 0 | 1 | 1.67 | 2.28 | 0.00 | 0.00 | 0.00 | 4.50 | 12.14 | ▇▂▂▁▁ |
| biomass | 0 | 1 | 238.81 | 178.79 | 0.00 | 93.94 | 234.52 | 378.17 | 663.21 | ▇▂▃▃▂ |
| river | 0 | 1 | 1825.34 | 1244.21 | 113.24 | 879.07 | 1403.61 | 2633.03 | 6016.68 | ▇▅▃▁▁ |
| other | 0 | 1 | 275.55 | 118.25 | 0.00 | 220.30 | 265.80 | 340.00 | 991.47 | ▂▇▁▁▁ |
Histogram and Box plot of mcp_dollars variable are shown.
ggplot(df, aes(x=mcp_dollars)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(df$mcp_dollars)
Histogram load_plan and total_prod variable are shown. We can say that variables are distributed similarly and have characteristics of normal distribution around the mean value of 30000.
ggplot(df, aes(x=load_plan)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df, aes(x=total_prod)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Box plots of load_plan total_prod variable
boxplot(df$load_plan)
boxplot(df$total_prod)
Seasonality plots of load_plan variable for several hours are shown. 0, 4, 8, 12, 16, 20 values of hours are tried to be plotted. It can be said that there is generally some trend pattern yearly. However, it is seen that it shows some other seasonal pattern for a shorter period of time like weekly or monthly.
as_tsibble(df %>% filter(hour == 0)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.
as_tsibble(df %>% filter(hour == 4)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.
as_tsibble(df %>% filter(hour == 8)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.
as_tsibble(df %>% filter(hour == 12)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.
as_tsibble(df %>% filter(hour == 16)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.
as_tsibble(df %>% filter(hour == 20)) %>% gg_season(load_plan) + theme_tq()
## Using `date` as index variable.
Four weeks of MCP Price line graph for every hour is shown to check if there is any seasonality in the variable. There are some weekly seasonality for some hours but there is no general seasonality for whole day.
df %>%
filter(date >= as_date("2019-01-07") & date <= as_date("2019-02-04")) %>%
ggplot(aes(x = date, y = mcp_dollars)) +
geom_line(aes(color = as.factor(hour)), size = 0.9, show.legend = T) +
labs(title="", y="MCP Price (USD)", x="Date") +
facet_wrap(~hour, ncol=4) +
theme_tq()
Hourly MCP Price line graph for every day of a week is shown to check if there is any seasonality in the variable. There are some similar patterns for some days of weeks; however, there are not exact pattern for every day.
df %>%
filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>%
ggplot(aes(x = hour, y = mcp_dollars)) +
geom_line(aes(color = as.factor(date)), size = 0.9) +
labs(title="", y="MCP Price (USD)", x="Hour") +
theme_tq()
Four weeks Electricity Usage (load_plan) line graph for every hour is shown to check if there is any seasonality in the variable. We can absolutely say that there is weekly seasonality pattern in this variable. Every same day of week shows similar pattern and especially Sundays, there are a sharp decrease in electricity usage.
df %>%
filter(date >= as_date("2019-01-07") & date <= as_date("2019-02-04")) %>%
ggplot(aes(x = date, y = load_plan)) +
geom_line(aes(color = as.factor(hour)), size = 0.9, show.legend = T) +
labs(title="", y="Electricity Usage", x="Date") +
# facet_wrap(~hour, ncol=4) +
theme_tq()
Hourly Electricity Usage (load_plan) line graph for every day of a week is shown to check if there is any seasonality. We can also absolutely say that there is hourly seasonality pattern in this variable. Every same hour of a day shows similar pattern. Electricity usage of night hours are low as it is expected.
df %>%
filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>%
ggplot(aes(x = hour, y = load_plan)) +
geom_line(aes(color = as.factor(date)), size = 0.9) +
labs(title="", y="Electricity Usage", x="Hour") +
theme_tq()
Four weeks Total Electricity Production (total_prod) line graph for every hour is shown to check if there is any seasonality. It shows similar characteristics with electricity usage (load_plan) variable.
df %>%
filter(date >= as_date("2019-01-07") & date <= as_date("2019-02-04")) %>%
ggplot(aes(x = date, y = total_prod)) +
geom_line(aes(color = as.factor(hour)), size = 0.9, show.legend = T) +
labs(title="", y="Total Electricity Production", x="Date") +
# facet_wrap(~hour, ncol=4) +
theme_tq()
Hourly Total Electricity Production (total_prod) line graph for every day of a week is shown to check if there is any seasonality. It shows similar characteristics with electricity usage (load_plan) variable.
df %>%
filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>%
ggplot(aes(x = hour, y = total_prod)) +
geom_line(aes(color = as.factor(date)), size = 0.9) +
labs(title="", y="Total Electricity Production", x="Hour") +
theme_tq()
New variables like lagged variables or ratio type variables are created to normalize the seasonality/trend effects. renew_energy_ratio variable shows ratio of the production fulfilled by renewable energy sources. cons_prod_ratio variable shows how much total production meets demand. Finally, weekly (lag168) and monthly (lag672) lagged variables created from both of these new variables and also from the target variable.
df2 <- df %>%
mutate(renew_energy_ratio = (wind + geothermal + dam + biomass + river) / total_prod,
cons_prod_ratio = load_plan / total_prod,
mcp_dollars_lag168 = lag(mcp_dollars, 168),
renew_energy_ratio_lag168 = lag(renew_energy_ratio, 168),
cons_prod_ratio_lag168 = lag(cons_prod_ratio, 168),
mcp_dollars_lag672 = lag(mcp_dollars, 672),
renew_energy_ratio_lag672 = lag(renew_energy_ratio, 672),
cons_prod_ratio_lag672 = lag(cons_prod_ratio, 672)) %>%
select(date, hour, mcp_dollars, mcp_dollars_lag168, renew_energy_ratio_lag168, cons_prod_ratio_lag168,
mcp_dollars_lag672, renew_energy_ratio_lag672, cons_prod_ratio_lag672) %>%
filter(date >= as_date("2015-06-29"))
Variable distribution, scatter and histogram plots are drawn for one week with selected variables
GGally::ggpairs(df2 %>% filter(date >= as_date("2019-01-07") & date <= as_date("2019-01-13")) %>% select(-date, -hour)) + theme_tq()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Train and test data sets are created
df2_train <- df2 %>% filter(date < as_date("2021-05-22"))
df2_test <- df2 %>% filter(date >= as_date("2021-05-22"))
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
Linear Regression model with repeated cross validation (fitControl) is created by using caret package (“lm” method). Summary results of the model is shown. R-squared value is not high enough. The explanatory power of the model seems low at first glance.
lin_reg <- train(mcp_dollars ~ .,
data = df2_train %>% select(-date, -hour),
method = "lm",
trControl = fitControl,
tuneLength = 5)
summary(lin_reg)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -226.63 -4.09 0.77 4.78 479.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.481142 0.969461 17.000 < 0.0000000000000002 ***
## mcp_dollars_lag168 0.475283 0.004090 116.210 < 0.0000000000000002 ***
## renew_energy_ratio_lag168 -4.660313 0.639829 -7.284 0.000000000000329 ***
## cons_prod_ratio_lag168 -1.445605 0.661822 -2.184 0.028946 *
## mcp_dollars_lag672 0.250899 0.004071 61.635 < 0.0000000000000002 ***
## renew_energy_ratio_lag672 5.528294 0.646590 8.550 < 0.0000000000000002 ***
## cons_prod_ratio_lag672 -2.537103 0.660776 -3.840 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.07 on 51689 degrees of freedom
## Multiple R-squared: 0.4172, Adjusted R-squared: 0.4171
## F-statistic: 6167 on 6 and 51689 DF, p-value: < 0.00000000000000022
Decision Tree model with repeated cross validation (fitControl) is created by using caret package (“rpart” method). RMSE vs Complexity Parameter plot is shown and the model with lowest CP and RMSE is chosen as a final model among the decision trees. Also, the tree plot is shown and it can be said that only lag variables of mcp_dollars is used at the best decision tree alternative.
dec_tree <- train(mcp_dollars ~ .,
data = df2_train %>% select(-date, -hour),
method = "rpart",
trControl = fitControl,
tuneLength = 5)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
trellis.par.set(caretTheme())
plot(dec_tree)
fancyRpartPlot(dec_tree$finalModel)
Random Forest model with repeated cross validation (fitControl) is created by using caret package (“ranger” method). Impurity measure is used as an importance criteria and number of trees parameter is selected as 50 because of performance issues. Final model of random forest have value of 2 as mtry parameter and splitrule as variance. RMSE value of best random forest alternative is 9.10 and R-squared value of best random forest alternative is around 60% which is better than other model methods (linear regression and decision tree).
random_forest <- train(mcp_dollars ~ .,
data = df2_train %>% select(-date, -hour),
method = "ranger",
trControl = fitControl,
num.trees = 50,
importance = "impurity")
plot(random_forest)
random_forest$results
## mtry min.node.size splitrule RMSE Rsquared MAE RMSESD
## 1 2 5 variance 9.112124 0.6070154 5.826402 0.4763914
## 2 2 5 extratrees 9.176268 0.6046028 5.898776 0.5065473
## 3 4 5 variance 9.185223 0.5993560 5.878185 0.4714967
## 4 4 5 extratrees 9.107398 0.6078719 5.860768 0.4679191
## 5 6 5 variance 9.304511 0.5884474 5.920130 0.4976521
## 6 6 5 extratrees 9.105426 0.6072850 5.869751 0.4224447
## RsquaredSD MAESD
## 1 0.01897971 0.05536528
## 2 0.02121679 0.05789229
## 3 0.01935909 0.05948841
## 4 0.01932822 0.05861577
## 5 0.02198343 0.06275586
## 6 0.01598648 0.05713555
random_forest$finalModel
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
##
## Type: Regression
## Number of trees: 50
## Sample size: 51696
## Number of independent variables: 6
## Mtry: 6
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: extratrees
## Number of random splits: 1
## OOB prediction error (MSE): 84.31614
## R squared (OOB): 0.5988944
Candidate models are compared below. It is obviously seen that Random Forest model have better R-squared value and lower (so better) MAE and RMSE values. As a result of these analysis, Random Forest model is selected.
results = resamples(list(Linear_Regression = lin_reg, Decision_Tree = dec_tree, Random_Forest = random_forest))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: Linear_Regression, Decision_Tree, Random_Forest
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Linear_Regression 6.891402 6.994913 7.027625 7.035182 7.064994 7.272792 0
## Decision_Tree 7.169201 7.264244 7.294622 7.299560 7.336142 7.423114 0
## Random_Forest 5.748468 5.836326 5.870224 5.869751 5.921145 5.959173 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Linear_Regression 10.155073 10.659505 11.025772 11.068541 11.552270 12.21379
## Decision_Tree 10.675618 10.911633 11.362176 11.362024 11.649588 12.22990
## Random_Forest 8.585016 8.766165 9.051721 9.105426 9.248408 10.30312
## NA's
## Linear_Regression 0
## Decision_Tree 0
## Random_Forest 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Linear_Regression 0.3570434 0.4001997 0.4207127 0.4178321 0.4404207 0.4639462
## Decision_Tree 0.3536464 0.3705948 0.3818625 0.3861824 0.4073355 0.4226206
## Random_Forest 0.5594306 0.5982082 0.6093993 0.6072850 0.6148209 0.6377483
## NA's
## Linear_Regression 0
## Decision_Tree 0
## Random_Forest 0
bwplot(results)
Prediction and residual analysis for train set of random forest model is shown. Actual vs Predicted plot shows that predicted values are generally close to the actual values. There are few observations which are too much deviated. Histogram of residuals shows normal distribution around mean value of 0 which is good for the model. Finally, predicted vs residuals plot shows that predicted values are distributed around 0 which is also good for the model even if there some few deviated values, they are not so important and not so much.
rf_predicted <- predict(random_forest, data = df2_train)
plot(rf_predicted, df2_train$mcp_dollars, xlab = "Predicted", ylab = "Actual", main = "Actual vs Predicted Plot for Random Forest Model")
abline(a=0,b=1,col='red', lty = 2)
rf_residuals <- df2_train$mcp_dollars - rf_predicted
hist(rf_residuals, xlab = "Residuals", main = "Residuals Histogram of Random Forest Model")
plot(rf_predicted, rf_residuals, xlab = "Predicted", ylab = "Residuals", main = "Predicted vs Residuals Plot for Random Forest Model")
abline(h = 0, col = "red", lty = 2)
Variable Importance of random forest model is shown. mcp_dollars_lag168 variable is the most important variable for the model followed by mcp_dollars_lag672 and renew_energy_ratio_lag168 variables.
sort(random_forest$finalModel$variable.importance, decreasing = TRUE)
## mcp_dollars_lag168 mcp_dollars_lag672 renew_energy_ratio_lag168
## 3792721.6 2232439.8 1326302.5
## renew_energy_ratio_lag672 cons_prod_ratio_lag672 cons_prod_ratio_lag168
## 1002438.9 909557.9 850075.4
For the test period of data, target variable (mcp_dollars) are forecasted by using predict function. Prediction and residual analysis for test set of random forest model is shown. Actual vs Predicted plot shows that predicted values are generally close to the actual values. There are few observations which are too much deviated. Histogram of residuals shows normal distribution around mean value of 0 which is good for the model. Finally, predicted vs residuals plot shows that predicted values are distributed around 0 which is also good for the model even if there some few deviated values, they are not so important and not so much.
rf_predicted_test <- predict(random_forest$finalModel, data = df2_test)
plot(rf_predicted_test$predictions, df2_test$mcp_dollars, xlab = "Predicted", ylab = "Actual", main = "Actual vs Predicted Plot for Random Forest Model with Test Set")
abline(a=0,b=1,col='red', lty = 2)
rf_residuals_test <- df2_test$mcp_dollars - rf_predicted_test$predictions
hist(rf_residuals_test, xlab = "Residuals", main = "Residuals Histogram of Random Forest Model")
plot(rf_predicted_test$predictions, rf_residuals_test, xlab = "Predicted", ylab = "Residuals", main = "Predicted vs Residuals Plot for Random Forest Model with Test Set")
abline(h = 0, col = "red", lty = 2)
Lastly, RMSE values of test set predictions are shown. It is not a high value which is also good for the model.
RMSE(df2_test$mcp_dollars, rf_predicted_test$predictions)
## [1] 5.292324